1 Introduction

Our client, a hypothetical pharmaceutical company, is looking to understand better what the data related to individuals (clients and potential clients) and various health conditions and miscellaneous attributes. The goal is to extract meaningful information that could guide future research and assist with the company rapidly expanding business and market share while focusing on and improving the wellbeing of the clients.

1.1 Terms and defenitions

  • Individual - A person who has been surveyed by the NHMS (National Health Measurement Study) dataset for various attributes related to the following: demographics, examinations, dietary, questionnaire (medical history), and medication.

  • Health Conditions - Various diseases or ailments that people may inhibit, such as sleep disorders, diabetes, oral health, cholesterol.

  • The National Health and Nutrition Examination Survey (NHANES) - A program of studies designed to assess the health and nutritional status of adults and children in the United States.

1.2 Data Description

The data gettered is spread into six distinct files (CSV format): Demographics, Examinations, Dietary, Laboratory, Questionnaire, and Medication.

2 Business Case

Our client wants to develop new drugs that primary intent to improve the quality of life of the individuals survived. The company is interested as to whether existing data on subjects and their associated health conditions could provide advice and insight to their researcher. They have obtained the NHANES dataset and requested our assistance to perform the intended analysis. This dataset contains individuals data along with various information, including health conditions.

The company is interested in developing new drugs for the following health conditions: diabetes, hypertension (blood pressure), and cancer.

The company, aware of our Machine Learning skills, approached us for help on the following problems:

2.1 First Problem: Reduce lab/data collection costs

With the healthcare dataset, the business has noted there are 1000s of attributes within the data. There are also many missing values throughout the data. The business has lot of old trial data and would like to enrol more patients with their diabetes drugs, but they don’t want spend too much on findings new candidates. They are unsure which attributes are the most meaningful in relation to diabetes. They also be in the middle of cancer trials and are looking for future possible referrals for their diabetes trials.

  1. Could there be a smaller subset of data could help tell who has diabetes? Because data collection could be refined to only capture those elements.

  2. And is there insights that could be gained from the in the demographics data in relation to diseases.

The company would ask us about possible wraping the model as a robust, easy to use App that could be present to managment and corporate to assist with the decision making, based on a few user inputs.

2.2 Second Problem: Marketing

The marketing department is struggling with high costs of television advertisements and is interested in ways to reduce their costs while still hitting their target markets for both the advertisement of drugs and attracting candidates for trails.

3 Analytical Reframing for the Business Case.

To address the first business problem, we will apply supervised and unsupervised machine learning. From 1000s attribute across the dataset, we will flatten (PCA) the dataset and apply supervised machine learning algorithms to predict who has diabetes(a); this will hopefully improve the phramedical’s capability to find referrals using current or past trial data. The key will be to use as few attributes as possible in order to maximize its portability. Secondly, we will use an unsupervised clustering approach on the demographics data to explore whether the data shows any significant findings for the company (b).

The second business problem involves using “health condition” features and finding related features. We will apply 2 types of unsupervised machine learning approaches to address this problem. Firstly, we will use an association learning methods to discover what attributes are associated with health conditions (a). We will borrow an approach that is traditionally used for market basket analysis.

Secondly, we will use machine learning unsupervised clustering techniques to look for meaningful insights in the data (b).

Questions to consider during work:

  • According to their health conditions, can subjects be divided into discrete groups that could provide meaningful data for the drug researchers?
  • Can this information be useful to reduce the costs of clinical trials?

If this is the case, we need to find clusters of subjects that segregate the data by health conditions and report these findings to the business.

If this is the case, we need to find clusters of subjects that segregate the data by health conditions and report these findings to the business.

  • What features are associated with “health condition” features?
  • We’re comparing the rows of the dataset.

3.1 How do we define “health condition” features within the dataset ?

For the our problems, we will need to see which attributes are tied to the “health condition” features. In order to achieve this, we are assuming that the following columns/features of the Questionnaire dataset indicate that an individual has a “health condition”:

3.1.1 Diabetes

DIQ010 - Doctor told you have diabetes https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DIQ_H.htm The next questions are about specific medical conditions. {Other than during pregnancy, {have you/has SP}/{Have you/Has SP}} ever been told by a doctor or health professional that {you have/{he/she/SP} has} diabetes or sugar diabetes?

3.1.2 Hypertension(blood pressure)

BPQ020 - Ever told you had high blood pressure https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/BPQ_H.htm {Have you/Has SP} ever been told by a doctor or other health professional that {you/s/he} had hypertension, also called high blood pressure?

3.1.3 Cancer

MCQ220 - Ever told you had cancer or malignancy https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/MCQ_H.htm#MCQ220 {Have you/Has SP} ever been told by a doctor or other health professional that {you/s/he} had cancer or a malignancy (ma-lig-nan-see) of any kind?

4 Loading R packages

library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(mice)
library(scales)
library(randomForest)
library(psych)
library(factoextra)
library(RColorBrewer)
library(caret)
library(plotly)
library(scales)
library(AMR)

5 Data cleaning

As indicated earlier, the dataset consists of six raw data files: Demographics, Examinations, Dietary, Laboratory, Questionnaire, and Medication. The largest dataset, in terms of attributes, contains 953 variables, while the smallest one contains 47 variables.

Because this is a large amount of data, with over a thousand attributes cumulatively, we decided to employ the following guidelines to reduce the complexity of the data:

  • If more than 25% of the values are missing for an attribute(column), we will consider removing the column from further evaluation.
  • If the majority of attributes are missing 25% or more of their values for a given dataset, we will use business and expert judgement to select a smaller subset of values of interest subjectively. Of these subsets of missing values, we will decide how to impute the values on these attributes.
  • We will also use discretion where appropriate.

Ideally, we would like to analyze and impute every attribute with missing values, but in this situation, it may not be practical due to the large volume of missing data.

5.1 Checking for missing data

It is always essentialto check for missing values and consider how to addreess them in the model.

We decided to represent the Demographic and Diet datasets as they are mostly complete.

We found that the percentage of missing data in four of the six spreadsheets is very significant. Almost all attributes/columns have varying degrees of missing values.

5.1.1 Demographic

5.1.2 Diet

5.1.3 Examination

5.1.4 Labs

5.1.5 Medications

5.1.6 Questionnaire

5.2 Data Imputation

As per our guidelines, we will select attributes/columns of interest based on our business/personal judgements. The full NHANES data dictionary/variable list is available at the following URL:

https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2013

5.2.1 Demographic

We first remove the variables having near zero variance in the dataset.Later we will remove the variables having more that 25% missing values in the dataset for Demographics.

We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:

Categorization of variables

We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again

Now we prepare the dataset for impute from all the information.

  • ========================== IMPUTATION( MICE package) =======================
  • Precisely, the methods used by this package are:
  • 1)-PMM (Predictive Mean Matching) For numeric variables
  • 2)-logreg(Logistic Regression) For Binary Variables( with 2 levels)
  • 3)-polyreg(Bayesian polytomous regression) For Factor Variables (>= 2 levels)
  • 4)-Proportional odds model (ordered, >= 2 levels)
  • 5)-cart Classification and regression trees (any)
  • 6)rf Random forest imputations (any)
  • ==============================================================================
    Imputation and save results:

We have selected the following 8 relevant columns among the 32 that have less than 25% of missing values:

Now we will label the dataset for visualizations.

5.2.2 Diet

First we would remove all the Near Zero Variance features from the data set, Cutt off being 45% :

Now, we will remove the features having a missing values of more that 25% as decided before:

We have selected the following 69 relevant columns among the 88 that have less than 25% of missing values:

  • ID (SEQN) - Respondent sequence number.
  • Number_of_foods (DR1TNUMF) - Total number of foods/beverages reported in the individual foods file.
  • Energy_kcal (DR1TKCAL) - Energy (kcal).
  • Protein (DR1TPROT) - Protein (gm).
  • Carbohydrate (DR1TCARB) - Carbohydrate (gm).
  • Sugar (DR1TSUGR) - Total sugars (gm).
  • Fiber (DR1TFIBE) - Dietary fiber (gm).
  • Total_fat (DR1TTFAT) - Total fat (gm).
  • Sat_fats (DR1TSFAT) - Total saturated fatty acids (gm).
  • Mono_fats (DR1TMFAT) - Total monounsaturated fatty acids (gm).
  • Poly_fats (DR1TPFAT) - Total polyunsaturated fatty acids (gm).
  • Cholesterol (DR1TCHOL) - Cholesterol (mg).
  • Tocopherol (DR1TATOC) - Vitamin E as alpha-tocopherol (mg).
  • Tocopherol_alpha (DR1TATOA) - Added alpha-tocopherol (Vitamin E) (mg).
  • Retinol (DR1TRET) - Retinol (mcg).
  • Retinol_a (DR1TVARA) - Vitamin A as retinol activity equivalents (mcg).
  • Carotene_alpha (DR1TACAR) - Alpha-carotene (mcg).
  • Carotene_beta (DR1TBCAR) - Beta-carotene (mcg).
  • Cryptoxanthin (DR1TCRYP) - Beta-cryptoxanthin (mcg).
  • Lycopene (DR1TLYCO) - Lycopene (mcg).
  • Lutein_zeaxanthin (DR1TLZ) - Lutein + zeaxanthin (mcg).
  • Thiamin (DR1TVB1) - Thiamin (Vitamin B1) (mg).
  • Riboflavin (DR1TVB2) - Riboflavin (Vitamin B2) (mg).
  • Niacin (DR1TNIAC) - Niacin (mg).
  • Vitamin_b6 (DR1TVB6) - Vitamin B6 (mg).
  • Folate (DR1TFOLA) - Total folate (mcg).
  • Folic_acid (DR1TFA) - Folic acid (mcg).
  • Folate_food (DR1TFF) - Food folate (mcg).
  • Folate_equivs (DR1TFDFE) - Folate as dietary folate equivalents (mcg).
  • Choline (DR1TCHL) - Total choline (mg).
  • Vitamine_b12 (DR1TVB12) - Vitamin B12 (mcg).
  • Vitamine_b12_added (DR1TB12A) - Added vitamin B12 (mcg).
  • Vitamine_c (DR1TVC) - Vitamin C (mg).
  • Vitamine_d (DR1TVD) - Vitamin D (D2 + D3) (mcg).
  • Vitamine_k (DR1TVK) - Vitamin K (mcg).
  • Calcium (DR1TCALC) - Calcium (mg).
  • Phosphorus (DR1TPHOS) - Phosphorus (mg).
  • Magnesium (DR1TMAGN) - Magnesium (mg).
  • Iron (DR1TIRON) - Iron (mg).
  • Zinc (DR1TZINC) - Zinc (mg).
  • Copper (DR1TCOPP) - Copper (mg).
  • Sodium (DR1TSODI) - Sodium (mg).
  • Potassium (DR1TPOTA) - Potassium (mg).
  • Selenium (DR1TSELE) - Selenium (mcg).
  • Caffeine (DR1TCAFF) - Caffeine (mg).
  • Theobromine (DR1TTHEO) - Theobromine (mg).
  • Alcohol (DR1TALCO) - Alcohol (gm).
  • Moisture (DR1TMOIS) - Moisture (gm).
  • Butanoic (DR1TS040) - SFA 4:0 (Butanoic) (gm).
  • Hexanoic (DR1TS060) - SFA 6:0 (Hexanoic) (gm).
  • Octanoic (DR1TS080) - SFA 8:0 (Octanoic) (gm).
  • Decanoic (DR1TS100) - SFA 10:0 (Decanoic) (gm).
  • Dodecanoic (DR1TS120) - SFA 12:0 (Dodecanoic) (gm).
  • Tetradecanoic (DR1TS140) - SFA 14:0 (Tetradecanoic) (gm).
  • Hexadecanoic (DR1TS160) - SFA 16:0 (Hexadecanoic) (gm).
  • Octadecanoic (DR1TS180) - SFA 18:0 (Octadecanoic) (gm).
  • Hexadecenoic (DR1TM161) - MFA 16:1 (Hexadecenoic) (gm).
  • Octadecenoic (DR1TM181) - MFA 18:1 (Octadecenoic) (gm).
  • Eicosenoic (DR1TM201) - MFA 20:1 (Eicosenoic) (gm).
  • Docosenoic (DR1TM221) - MFA 22:1 (Docosenoic) (gm).
  • Octadecadienoic (DR1TP182) - PFA 18:2 (Octadecadienoic) (gm).
  • Octadecatrienoic (DR1TP183) - PFA 18:3 (Octadecatrienoic) (gm).
  • Octadecatetraenoic (DR1TP184) - PFA 18:4 (Octadecatetraenoic) (gm).
  • Eicosapentaenoic (DR1TP204) - PFA 20:4 (Eicosatetraenoic) (gm).
  • Eicosapentaenoic (DR1TP205) - PFA 20:5 (Eicosapentaenoic) (gm).
  • Docosapentaenoic (DR1TP225) - PFA 22:5 (Docosapentaenoic) (gm).
  • Docosahexaenoic (DR1TP226) - PFA 22:6 (Docosahexaenoic) (gm).
  • Food_yesterday (DR1_300) - Was the amount of food that {you/NAME} ate yesterday much more than usual, usual, or much less than usual?.
  • Water_yesterday (DR1_320Z) - Total plain water drank yesterday - including plain tap water, water from a drinking fountain, water from a water cooler, bottled water, and spring water..

We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:

Categorization of variables

We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again

Now we prepare the dataset for impute from all the information.

  • ========================== IMPUTATION( MICE package) =======================
  • Precisely, the methods used by this package are:
  • 1)-PMM (Predictive Mean Matching) For numeric variables
  • 2)-logreg(Logistic Regression) For Binary Variables( with 2 levels)
  • 3)-polyreg(Bayesian polytomous regression) For Factor Variables (>= 2 levels)
  • 4)-Proportional odds model (ordered, >= 2 levels)
  • 5)-cart Classification and regression trees (any)
  • 6)rf Random forest imputations (any)
  • ==============================================================================

Labeling the dataset:

5.2.3 Examination

First we would remove all the Near Zero Variance features from the data set, Cutt off being 45% :

Now, we will remove the features having a missing values of more that 25% as decided before:

We have selected the following 12 relevant columns among the 105 that have less than 25% of missing values:

  • ID (SEQN) - Respondent sequence number.
  • BP_test_time_exam (PEASCTM1) - Blood Pressure Time in Seconds.
  • BP_arm_exam (BPAARM) - Arm selected:.
  • BP_Systolic_exam (BPXSY2) - Systolic: Blood pressure (second reading) mm Hg.
  • BP_Diastolic_exam (BPXDI2) - Diastolic: Blood pressure (second reading) mm Hg.
  • Weight_exam (BMXWT) - Weight (kg).
  • Height_exam (BMXHT) - Standing Height (cm).
  • Leg_length_exam (BMXBMI) - Body Mass Index (kg/m**2).
  • Arm_length_exam (BMXLEG) - Upper Leg Length (cm).
  • Waist_circumference_exam (BMXWAIST) - Waist Circumference (cm).
  • Dominant_hand_exam (MGD130) - Are you right-handed, left-handed, or do you use both hands equally?.
  • Grip_strength_exam (MGDCGSZ) - Combined grip strength (kg): the sum of the largest reading from each hand..

We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:

Categorization of variables

We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again

Now we prepare the dataset for impute from all the information.

  • ========================== IMPUTATION( MICE package) =======================
  • Precisely, the methods used by this package are:
  • 1)-PMM (Predictive Mean Matching) For numeric variables
  • 2)-logreg(Logistic Regression) For Binary Variables( with 2 levels)
  • 3)-polyreg(Bayesian polytomous regression) For Factor Variables (>= 2 levels)
  • 4)-Proportional odds model (ordered, >= 2 levels)
  • 5)-cart Classification and regression trees (any)
  • 6)rf Random forest imputations (any)
  • ==============================================================================

Labeling the dataset:

5.2.4 Labs

First we would remove all the Near Zero Variance features from the data set, Cutt off being 45% :

Now, we will remove the features having a missing values of more that 25% as decided before:

We have selected the following 9 relevant columns among the 46 that have less than 25% of missing values:

  • ID (SEQN) - Respondent sequence number..
  • White_blood_cells_labs (LBXWBCSI) - White blood cell count (1000 cells/uL).
  • Red_bloods_cells_labs (LBXRBCSI) - Red blood cell count (million cells/uL).
  • Caffeine_labs (PHQ020) - Coffee or tea with cream or sugar? [Include milk or non-dairy creamers.].
  • Alcohol_labs (PHQ030) - Alcohol, such as beer, wine, or liquor?.
  • Supplements_labs (PHQ060) - Dietary supplements such as vitamins and minerals? [Include multivitamins and single nutrient supplements.].
  • Hepatitis_a_labs (LBXHA) - Hepatitis A antibody.
  • Hepatitis_b_labs (LBXHBC) - Hepatitis B core antibody.
  • Cholesterol_labs (LBXTC) - Total Cholesterol( mg/dL).

We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:

Categorization of variables

We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again

Now we prepare the dataset for impute from all the information.

  • ========================== IMPUTATION( MICE package) =======================
  • Precisely, the methods used by this package are:
  • 1)-PMM (Predictive Mean Matching) For numeric variables
  • 2)-logreg(Logistic Regression) For Binary Variables( with 2 levels)
  • 3)-polyreg(Bayesian polytomous regression) For Factor Variables (>= 2 levels)
  • 4)-Proportional odds model (ordered, >= 2 levels)
  • 5)-cart Classification and regression trees (any)
  • 6)rf Random forest imputations (any)
  • ==============================================================================

Labeling the dataset:

5.2.5 Medication

First we would remove all the Near Zero Variance features from the data set, Cutt off being 45% :

Now, we will remove the features having a missing values of more that 32% as decided before:

All of the columns had more than 25% missing values. Among the 8 columns with less than 32% of missing value we have selected the following 5 relevant columns:

  • ID (SEQN) - Respondent sequence number.
  • Medication (RXDDRGID) - Generic drug name.
  • Days_taken (RXDDAYS) - Number of days taken medicine.
  • Medication_purpose (RXDRSD1) - Reasons for use reported by participants.
  • Amount_taken (RXDCOUNT) - Number of prescription medicines taken.

We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:

Categorization of variables

We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again

Now we prepare the dataset for impute from all the information.

  • ========================== IMPUTATION( MICE package) =======================
  • Precisely, the methods used by this package are:
  • 1)-PMM (Predictive Mean Matching) For numeric variables
  • 2)-logreg(Logistic Regression) For Binary Variables( with 2 levels)
  • 3)-polyreg(Bayesian polytomous regression) For Factor Variables (>= 2 levels)
  • 4)-Proportional odds model (ordered, >= 2 levels)
  • 5)-cart Classification and regression trees (any)
  • 6)rf Random forest imputations (any)
  • ==============================================================================

Labeling the dataset:

5.2.6 Questionnaire

First, we will remove the near zero vairiance variables.

Now, we will remove the features having a missing values of more that 25% as decided before:

We have selected the following 38 relevant columns among the 79 that have less than 25% of missing values:

  • ID (SEQN) - Respondent sequence number.
  • Spent_total (CBD070) - The next questions are about how much money {your family spends/you spend} on food. First I’ll ask you about money spent at supermarkets or grocery stores. Then we will talk about money spent at other types of stores. During the past 30 days, how much money {did your family/did you} spend at supermarkets or grocery stores? Please include purchases made with food stamps..
  • Spent_groceries (CBD110) - About how much money {did your family/did you} spend on food at these types of stores? (Please do not include any stores you have already told me about.).
  • Spent_vending (CBD120) - During the past 30 days, how much money {did your family/did you} spend on eating out? Please include money spent in cafeterias at work or at school or on vending machines, for all family members..
  • Spent_delivered (CBD130) - During the past 30 days, how much money {did your family/did you} spend on food carried out or delivered? Please do not include money you have already told me about..
  • Cold_30 (HSQ500) - Did {you/SP} have a head cold or chest cold that started during those 30 days?.
  • Stomach_30 (HSQ510) - Did {you/SP} have a stomach or intestinal illness with vomiting or diarrhea that started during those 30 days?.
  • Flu_30 (HSQ520) - Did {you/SP} have flu, pneumonia, or ear infections that started during those 30 days?.
  • Diabetes (DIQ010) - The next questions are about specific medical conditions. {Other than during pregnancy, {have you/has SP}/{Have you/Has SP}} ever been told by a doctor or health professional that {you have/{he/she/SP} has} diabetes or sugar diabetes?.
  • Taking_insuline (DIQ050) - {Is SP/Are you} now taking insulin.
  • Milk_30 (DBQ197) - Now I’m going to ask a few questions about milk products. Do not include their use in cooking. In the past 30 days, how often did {you/SP} have milk to drink or on {your/his/her} cereal? Please include chocolate and other flavored milks as well as hot cocoa made with milk. Do not count small amounts of milk added to coffee or tea. Would you say….
  • Meals_outside (DBD895) - Next I’m going to ask you about meals. By meal, I mean breakfast, lunch and dinner. During the past 7 days, how many meals {did you/did SP} get that were prepared away from home in places such as restaurants, fast food places, food stands, grocery stores, or from vending machines? {Please do not include meals provided as part of the school lunch or school breakfast./Please do not include meals provided as part of the community programs you reported earlier.}.
  • Meals_premade (DBD905) - Some grocery stores sell “ready to eat” foods such as salads, soups, chicken, sandwiches and cooked vegetables in their salad bars and deli counters. During the past 30 days, how often did {you/SP} eat “ready to eat” foods from the grocery store? Please do not include sliced meat or cheese you buy for sandwiches and frozen or canned foods..
  • Meals_frozen (DBD910) - During the past 30 days, how often did you {SP} eat frozen meals or frozen pizzas? Here are some examples of frozen meals and frozen pizzas..
  • Deafness (DLQ010) - With this next set of questions, we want to learn about people who have physical, mental, or emotional conditions that cause serious difficulties with their daily activities. Though different, these questions may sound similar to ones I asked earlier. {Are you/Is SP} deaf or {do you/does he/does she} have serious difficulty hearing?.
  • Blindness (DLQ020) - {Are you/Is SP} blind or {do you/does he/does she} have serious difficulty seeing even when wearing glasses?.
  • Forgetfulness (DLQ040) - Because of a physical, mental, or emotional condition, {do you/does he/does she} have serious difficulty concentrating, remembering, or making decisions?.
  • Food_assistance (FSD151) - In the last 12 months, did {you/you or any member of your household} ever get emergency food from a church, a food pantry, or a food bank, or eat in a soup kitchen?.
  • WIC_assistance (FSQ162) - In the last 12 months, did {you/you or any member of your household} receive benefits from the WIC program, that is, the Women, Infants and Children program?.
  • Hepatitis_b (HEQ010) - Has a doctor or other health professional ever told {you/SP} that {you have/s/he/SP has} Hepatitis B? (Hepatitis is a form of liver disease. Hepatitis B is an infection of the liver from the Hepatitis B virus (HBV).).
  • Hepatitis_c (HEQ030) - Has a doctor or other health professional ever told {you/SP} that {you have/s/he/SP has} Hepatitis C? (Hepatitis is a form of liver disease. Hepatitis C is an infection of the liver from the Hepatitis C virus (HCV).).
  • Insurance_current (HIQ011) - The (first/next) questions are about health insurance. {Are you/Is SP} covered by health insurance or some other kind of health care plan? [Include health insurance obtained through employment or purchased directly as well as government programs like Medicare and Medicaid that provide medical care or help pay medical bills.].
  • Insurance_lapse_12 (HIQ210) - In the past 12 months, was there any time when {you/SP} did not have any health insurance coverage?.
  • House_rooms (HOD050) - How many rooms are in this home? Count the kitchen but not the bathroom..
  • Health_current (HUQ010) - {First/Next} I have some general questions about {your/SP’s} health. Would you say {your/SP’s} health in general is . . ..
  • Health_institution (HUQ041) - {What kind of place is it - a clinic, doctor’s office, emergency room, or some other place?} {What kind of place {do you/does SP} go to most often - a clinic, doctor’s office, emergency room, or some other place?}.
  • Doctor_visits_12 (HUQ051) - {During the past 12 months, how/How} many times {have you/has SP} seen a doctor or other health care professional about {your/his/her} health at a doctor’s office, a clinic or some other place? Do not include times {you were/s/he was} hospitalized overnight, visits to hospital emergency rooms, home visits or telephone calls..
  • Health_mental_12 (HUQ090) - During the past 12 months, that is since {DISPLAY CURRENT MONTH} of {DISPLAY LAST YEAR}, {have you/has SP} seen or talked to a mental health professional such as a psychologist, psychiatrist, psychiatric nurse or clinical social worker about {your/his/her} health?.
  • Family_income_mo (IND235) - Monthly family income (reported as a range value in dollars)..
  • Asthma (MCQ010) - The following questions are about different medical conditions. Has a doctor or other health professional ever told {you/SP} that {you have/s/he/SP has} asthma (az-ma)?.
  • Anemia (MCQ053) - During the past 3 months, {have you/has SP} been on treatment for anemia (a-nee-me-a), sometimes called “tired blood” or “low blood”? [Include diet, iron pills, iron shots, transfusions as treatment.].
  • Celiac (MCQ082) - Has a doctor or other health professional ever told {you/SP} that {you have/s/he/SP has} celiac (sele-ak) disease, also called or sprue (sproo)?.
  • Gluten_free_diet (MCQ086) - {Are you/is SP} on a gluten-free diet?.
  • Jaundice (MCQ203) - Has anyone ever told {you/SP} that {you/she/he/SP} had yellow skin, yellow eyes or jaundice? Please do not include infant jaundice, which is common during the first weeks after birth..
  • Asthma_relatives (MCQ300B) - Including living and deceased, were any of {SP’s/your} close biological that is, blood relatives including father, mother, sisters or brothers, ever told by a health professional that they had asthma (az-ma)?.
  • Dentist_visit_since (OHQ030) - The next questions are about {your/SP’s} teeth and gums. About how long has it been since {you/SP} last visited a dentist? Include all types of dentists, such as, orthodontists, oral surgeons, and all other dental specialists, as well as dental hygienists..
  • TV_30 (PAQ710) - Now I will ask you first about TV watching and then about computer use. Over the past 30 days, on average how many hours per day did {you/SP} sit and watch TV or videos? Would you say . . ..
  • Gaming_hours (PAQ715) - Over the past 30 days, on average how many hours per day did {you/SP} use a computer or play computer games outside of school? Include Playstation, Nintendo DS, or other portable video games Would you say . . ..
  • Smoking_relatives (SMD460) - Now I would like to ask you a few questions about smoking in this home. How many people who live here smoke cigarettes, cigars, little cigars, pipes, water pipes, hookah, or any other tobacco product?.
  • Ride_motor_vehicle (SMQ870) - During the last 7 days, did {you/SP} ride in a car or motor vehicle?.

We will now refer to our Dictionary for making a reference dataframe to differentiate between different forms of variables in a fast and effective way:

Categorization of variables

We have to now enter categorization of Factor/Numeric/ ‘Computation not required’ in the excel file generated
* Only to be done in 3rd column…
* Code is….
* 0 = Factor requiring no computation.
* 1 = Numeric requiring computation.
* 2 = Factor requiring computation.
* Please write Column name for the category as “Cat”
Reading Index again

Now we prepare the dataset for impute from all the information.

  • ========================== IMPUTATION( MICE package) =======================
  • Precisely, the methods used by this package are:
  • 1)-PMM (Predictive Mean Matching) For numeric variables
  • 2)-logreg(Logistic Regression) For Binary Variables( with 2 levels)
  • 3)-polyreg(Bayesian polytomous regression) For Factor Variables (>= 2 levels)
  • 4)-Proportional odds model (ordered, >= 2 levels)
  • 5)-cart Classification and regression trees (any)
  • 6)rf Random forest imputations (any)
  • ==============================================================================
    Imputation and save results:

Now we label and save the data set:

6 Data Visualization

Perform visualization against the clean datasets and the union of the cleaned datasets

6.1 Demographics Data

6.1.1 Visualization of key features

Visuals against the cleaned dataset

6.1.1.1 Gender

6.1.1.2 Country of birth

6.1.1.3 Marital status

6.1.1.4 Race

6.1.1.5 Diabetes and Age

6.1.1.6 Number of proxy users

This graph shows the number of proxy users we have in our database:

This graph shows the number of proxy users having Diabetes:

6.1.1.7 Conclusion

Our samples is pretty representative of the US population:

6.2 Visualization for Diet

Visuals against the cleaned dataset

6.3 Visualization for Examination

Visuals against the cleaned dataset

6.4 Visualization for Labs

Visuals against the cleaned dataset

6.5 Visualization for Mediciation

Visuals against the cleaned dataset

6.6 Visualization for Questionnaire

Visuals against the cleaned dataset

6.7 Visualization for union of datsets

7 Problem 1(a): Supervised Prediction for diabetes

First, our target attributes need to be added to a dataset.

7.1 Create target dataset

As part of the business problem, we focusing on 3 targets(diabetes, hypertension, cancer):

Given an individual has diabetes, predict individual has cancer or hypertension. Use the less amount of data possible to keep costs low.

7.2 DIABETES

Marking data for Diabetes

7.2.1 DATA MINING

7.2.1.1 DEMOGRAPHIC DATA MINING FOR DIABETES

We now will keep the associated features related to Diabetes disease using PCA and Correlation plots.

Correlation Plot:

PCA

We notice is that the first 10 components has an Eigenvalue >1 and explains almost 80% of variance. So if wereduce dimensionality from 35 to 10 we will lose 20% of variance!

The two first components explains only 30% of the variance. We need 18 principal components to explain more than 95% of the variance and 27 to explain more than 0.99 Based on the analysis for Correlation and PCA, we decide to keep the below seleced variables.

7.2.1.2 DIET DATA MINING FOR DIABETES

We now will keep the associated features related to Diabetes disease using PCA and Correlation plots.

Correlation Plot:

PCA

We notice is that the first 24 components has an Eigenvalue >1 and explains almost 90% of variance. So if wereduce dimensionality from 87 to 24 we will lose 10% of variance!

The two first components explains only 35% of the variance. We need 27 principal components to explain more than 95% of the variance and 35 to explain more than 0.99 Based on the analysis for Correlation and PCA, we decide to keep the below 13 selected variables.

7.2.1.3 EXAMS DATA MINING FOR DIABETES

We now will keep the associated features related to Diabetes disease using PCA and Correlation plots.

Correlation Plot:

PCA

We notice is that the first 14 components has an Eigenvalue >1 and explains almost 75% of variance. So if we reduce dimensionality from 97 to 14 we will lose 25% of variance!

The two first components explains only 40% of the variance. We need 35 principal components to explain more than 95% of the variance and 42 to explain more than 0.99 Based on the analysis for Correlation and PCA, we decide to keep the below selected 31 variables.

7.2.1.4 LABS DATA MINING FOR DIABETES

We now will keep the associated features related to Diabetes disease using PCA and Correlation plots.

Correlation Plot:

PCA

We notice is that the first 24 components has an Eigenvalue >1 and explains almost 70% of variance. So if we reduce dimensionality from 77 to 24 we will lose 20% of variance!

The two first components explains only 20% of the variance. We need 22 principal components to explain more than 80% of the variance and 37 to explain more than 0.99 Based on the analysis for Correlation and PCA, we decide to keep the below selected 19 variables.

7.2.1.5 MEDICATIONS DATA MINING FOR DIABETES

We will run the MFA to find relation among features for data reduction.

We notice is that the component RXDUSE explains almost 75% of variance on 5 other components. So if we reduce dimensionality from 9 to 1 we will lose 25% of variance! We ananlyised each feauture in each dimension and found that the only feature having greater vairiance is RXDUSE.

7.2.1.6 QUESTIONNAIRE DATA MINING FOR DIABETES

We now will keep the associated features related to Diabetes disease using PCA and Correlation plots.

Correlation Plot:

PCA

We notice is that the first 22 components has an Eigenvalue >1 and explains almost 70% of variance. So if we reduce dimensionality from 75 to 10 we will lose 30% of variance!

The two first components explains only 35% of the variance. We need 35 principal components to explain more than 95% of the variance and 38 to explain more than 0.99 Based on the analysis for Correlation and PCA, we decide to keep the below selected 15 variables.

7.2.1.7 Combined Data Mining

We now will keep the associated features related to Diabetes disease using PCA and Correlation plots.

Correlation Plot:

PCA

select features correlated to the TARGET ( HAS_DIABETES) with “abs(coefficiant) > 0.1”

7.2.1.8 PCA FOR DIABETES

We notice is that the first 9 components has an Eigenvalue >1 and explains almost 80% of variance. So if wereduce dimensionality from 35 to 8 we will lose 20% of variance!

The two first components explains only 30% of the variance. We need 18 principal components to explain more than 95% of the variance and 27 to explain more than 0.99

7.2.2 Applying machine learning models

We are going to create a training and test set of these data:

7.2.2.1 Logistic Regression

Let’s try Logistic Regression:

7.2.2.2 Logistic Regression with PCA

Logistic Regression with pca:

7.2.2.3 Logistic Regression with correlated features

Let’s try Logistic Regression with the top five features correlated to the TARGET ( HAS_DIABETES):

1 “LBXGH”: “Glycohemoglobin (%)”

2 “LBXSGL”: “Glucose, refrigerated serum (mg/dL)”

3 “RIDAGEYR”: “Age in years of the participant”

4 “RXDUSE” : “In the past 30 days, have you taken medication for which a prescription is needed?”

5 “LBDHDDSI”: “Direct HDL-Cholesterol (mmol/L)”

7.2.2.4 Random Forest

Let’s try random forest:

7.2.2.5 Random Forest with PCA

Random forest with pca

7.2.2.6 Random Forest with correlated features

Let’s try Random forest with the top five features correlated to the TARGET ( HAS_DIABETES):

1 “LBXGH”: “Glycohemoglobin (%)”

2 “LBXSGL”: “Glucose, refrigerated serum (mg/dL)”

3 “RIDAGEYR”: “Age in years of the participant”

4 “RXDUSE” : “In the past 30 days, have you taken medication for which a prescription is needed?”

5 “LBDHDDSI”: “Direct HDL-Cholesterol (mmol/L)”

7.2.2.7 KNN

Let’s try KNN model

7.2.2.8 SVM with radial kernel

7.2.3 Model result comparasion

Let’s compare the models and check their correlation:

7.2.3.1 Correlation between models

7.2.3.1.1 Plot

7.2.3.1.2 Data

7.2.3.2 Comparasion

Most of the models have a low variability with respect of the processed sample. Random Forest (RF, PCA_RF, and CORR_RF) achieve a great auc with a very low variability.

Let’s remember how these models result with the testing dataset. Prediction classes are obtained by default with a threshold of 0.5 which could not be the best with an unbalanced dataset like this.

The best results for Sensitivity (detection of diabetes) is the Random forest with the top five correlated features, and The with PCA has a great F1 score.

7.2.4 Conclusions

We have found Random forest with the top five features correlated to the TARGET ( HAS_DIABETES) model preprocessed data with good results over the test set. This model has a sensibility of 0.997 with a F1 score of 0.998.

7.3 Shiny App

The ShinyApp was built to assist to predict a patients condition based on the selected attribues.

8 Problem 1(b) Unsupervised Cluster Model for Demographic

8.1 Unsupervised Cluster Model for Demographic

8.1.1 PCA

From the above graphs, we notice is that the first 4 components has an Eigenvalue >1 and explains almost 60% of variance! We can not effectively reduce dimensionality from 8 to 4 becuase we will lose about 40% of variance!

With just use the first two components, no diseases present separation between sick and healthy people . This clearly indicate that the we can not do classification base only on the demographics data.

8.1.2 k-means Clustering

From the above graph, we conclude that 6 is the appropriate number of clusters since it seems to be appearing at the bend in the elbow plot.

Now, let us take k = 6 as our optimal cluster

From the above visualization, we observe that in the clusters distribution both Male and female have almost the same range of age

9 Problem 2(a): Unsupervised Association

Find associations with diseases and diet/demographics data as per business problem.

9.1 Introduction

Associating mining if often used with market basket analysis. However, for healthcare dataset used NHANES, we will explore the associations between the data and attempt to provide value to addressing marketing business problems for the pharmedical company in adversiting their drugs and attracting individuals to clinical trails.

9.2 Preparing the data for association algorithms.

Our first task is to prepare the data for associating mining algorithms.

Since the associations rules will reference the values of the attributes. If a value says “Yes”, it might be ambigious what this means. However, if the value was, “US Citizen”, then the meaning would be precise. Below are a couple of examples where, we have re-coded the values for attributes as shown below:

The above recoding was performed for 18 attributes. Within the association dataset, we selected 18 attributes. We focused on attributes that were categorial values. For the purpose of association mining, numerical values may not add value unless they are binned into categories. For now, we have focused on 18 attributes that were available in the cleaned dataset. Sincer, the dataset is rich with many attributes. In the future, more attributes could be added into association mining algorithms if the business finds value in the suggestions of this type of analysis.

9.3 Transform dataset into a transactional dataset

In order to apply association algorithms, the dataset has to transformed into a tranactional dataset. First, we need to merge all categorical values requiring for mining into a single description attriubte:

9.4 Apply Association algorithms to the data

Now data is prepared, we can apply the association algorithms.

9.4.1 Frequent values

First, we create association rules against the dataset.

We plot the 20 most frequent values found within the data.

Per the above, as expected, US citizen, right-handed, born in US are some of the most frequent values. Also, it is also that the values for not having diseases is also at the top of the list.

OVer 400,000 rules are produced for entire data, let’s take a glance at 5 of them below.

In the above output, we can see different association mining rules for the entire dataset. The rules have LHS and RHS which demonstrate the relation between itemsets(collections of values). The items on LHS are associated and occur with the single item on the RHS. Now we will proceed to create association rules for having and not having the particular diseases (cancer, diabetes, hypertension). The RHS will be set to the particular health conditions/disases. And we will observe what typse of associations are discovered on the LHS.

9.4.2 Association rules for having diseases

In order to produce a list of association rules, we had to experiement with “conf”(confidence) parameter. For example, with positive cancer rules, we had to lower the confidence to 0.4 to produce mining rules. For each health condition(disaease),we have created 2 sets of rules. The first set of rules allow larger number of items to be produced on the LHS (maxlen=15); whereas, the second set of rules forces the rules to have a small amount of rules (maxlen=3).

9.4.2.1 Cancer

For cancer association rules, we will examine both large and small items found in conjunction with an individual having cancer.

First, we inspected the rules where individual has cancer and observed which large itemsets occur in conjunction with cancer. The confidence level was set 0.4 for this set of rules which might be considered low. However, a handful of rules were generated for this item. Of note, those that have cancer are also associated with having hypertension and cancer. An interesting observation is that drinking milk occurs in multiple rules.

Next, we inspected rules where individual has cancer and observed which small itemsets. The confidence level was set even lower to generate results for small itemsets in conjunction with cancer. Again, we similar items such having diabetes and hypertension appear in the small itemsets.

In order to build the association mining lists, we had to reduce confidence levels to under 0.5.

9.4.2.2 Diabetes

As with cancer association rules, we will examine both large and small items found in conjunction with an individual having diabetes.

For large itemset with a positive diabetes results, we were able to increase the confidence level to 0.7. 32 rules were generated for this result. Of note, a household income between “20000-24999”daily/weekly milk consumption and appears in several rules. Also interesting, that there are rules where an individual has health insurance coverage. None of the rules contain the opposite condition of not having health insurance coverage.

For small itemsets, all the rules include having cancer in association with diabetes.

9.4.2.3 Hypertention

First, we inspected the association rules with large itemsets for those individuals with hypertension. Rules with confidence levels of 1 are also found within this itemset. Unlike the previous 2 health conditions, race is appearing more prominently within the association rules.

Second, we inspected the association rules with small itemsets for those individuals with hypertension. A martial value of “widowed” appears more frequently than other martial values within the rules.

9.4.3 Interactive Scatter Plots

For rules we examined in the previous section, we’ve taken the top 20 rules and created interactive scatter plots and graphs to visualize data.

9.4.3.1 Cancer

The following is a scatter graph for visualizing the top 20 association rules for cancer with large itemsets. Please note, the points on the graph are interactive, please cursor over points to see association rule.

CANCER (large itemsets)

The following is a scatter graph for visualizing the top 20 association rules for cancer with small itemsets.

CANCER (small itemsets)

9.4.3.2 Diabetes

The following is a scatter plot for visualizing the top 20 association rules for diabetes with large itemsets.

DIABETES (large itemsets)

The following is a scatter plot for visualizing the top 20 association rules for diabetes with small itemsets.

DIABETES (small itemsets)

9.4.3.3 Hypertension

The following is a scatter plot for visualizing the top 20 association rules for hypertension with large itemsets.

HYPER TENSION (large itemsets)

The following is a scatter plot for visualizing the top 20 association rules for hypertension with small itemsets.

HYPER TENSION (small itemsets)

9.4.4 Interactive Graph Visualization

The following graphs are interactive. Hover the cursor over the rule, to see the related values. Hover the cursor over a value, to see the related rules.

9.4.4.1 Cancer

The following is a graph for visualizing the top 20 association rules for cancer with large itemsets.

CANCER (large itemset)

The following is a graph for visualizing the top 20 association rules for cancer with small itemsets.

CANCER (small itemset)

9.4.4.2 Diabetes

The following is a graph for visualizing the top 20 association rules for diabetes with large itemsets.

DIABETES (large itemset)

The following is a graph for visualizing the top 20 association rules for diabetes with small itemsets.

DIABETES (small itemset)

9.4.4.3 Hypertension

The following is a graph for visualizing the top 20 association rules for hypertension with large itemsets.

HYPERTENSION (large itemset)

The following is a graph for visualizing the top 20 association rules for hypertension with small itemsets.

HYPERTENSION (small itemset)

9.4.5 Individual Rule Representation

9.4.5.1 cancer

With the association rules for cancer, we’ve plotted the top 20 values that were represented in the itemsets.

9.4.5.2 diabetes

With the association rules for diabetes, we’ve plotted the top 20 values that were represented in the itemsets.

9.4.5.3 hypertension

With the association rules for hypertension, we’ve plotted the top 20 values that were represented in the itemsets.

9.4.6 Association rules for not having diseases

In the preceding section, we looked at associations between having diseases/health conditions and other values. To complement our findings, we decided to also create association rules for not having the diseases. This might yield beneficial findings and support any findings from the previous association rules involving positive values for diseases.

The following rules were used:

The rules for not having Cancer

Below, we’ve listed the top 20 association rules for not having cancer. Of interest, it appears “no health insurance” and “visiting multiple places for healthcare” appears in many rules that lead to not having cancer.

The rules for not having Diabetes

Below, we’ve listed the top 20 association rules for not having diabetes. There are multiple rules with the value where the individual receives health from various places (as opposed to one location).

The rules for not having Hypertension

Below, we’ve listed the top 20 association rules for not having hypertension. Of interest, values smokers and non-smokers both appear in the results. Many of the video game related values also appear in the rules below.

9.5 Conclusion

As we’ve gathered data for having diseases and not having diseases, we’ve attempted to gather insights from the findings that could provide business value to the marketing department as per the defined initial business problem. Please note the association rules do not establish causation. These association rules are only to highlight values that are associated or appear together. And our conclusion is subjective based on our interpretation of the data.

These association rules show what related items are found in conjunction with having different diseases and health conditions. Below, we will discuss some of our findings:

The value associated with drinking milk multiple times a day or week appears several times in diabetes and cancer conditions. Additionally, the values for drinking milk do not appear in the association rules for not having cancer/diabetes. It might be valid to position marketing for drugs on cancer/diabetes in conjunction with milk placement. For example, youtube video often place advertisements in pairs. Then, we could place a cancer drug advertisement appear after a milk advertisement in a youtube video. Please we are not suggesting that milk usage causes cancer. We are making a suggestion an association that is within the data.

For hypertension, many of the associated values for income are under $24,999. For marketing, placement of billboards in areas where salaries are under $24,999 could be helpful to market drugs towards those with hypertension.

Although, our business is focused on marketing drugs to patients for cancer, diabetes, and hypertension. We can look for out of the box solutions. If the business was looking to develop drugs (or supplements) related to the prevention of hypertension, we could use data to identify associations with audiences that do not already have a disease.

In summary, these are a few of the suggestions that could be derived from the data. We think these suggestions could have value and provide the “so what” for our conclusions.

9.5.1 Limitations of the conclusions

The association models used in the preceding sections contained 18 variables. We will provide our results to the business. However, the associations rules could be improved by adding more categorical variables or numerical variables (which have been binned). The activity of recoding and binning values from the raw data increases the overhead of adding more attributes. However, in the event, the business is intrigued by the findings, more data can be incorporated in the association ruleset.

Within the data for not having diseases, frequently, the condition not having medical insurance appears multiple times. Is this an indication, people without medical insurance are truly not associated with the diseases? Recall the data for the field is based on a questionnaire that presuppositions, the individual has seen a doctor. If the individual has not seen a doctor for diagnose due to health insurance coverage, then they may not have been able to accurately ascertain whether they have a particular disease.

10 Problem 2(b): Unsupervised Clustering problem

10.1 Business case

The marketing department is struggling with high costs of television advertisements and is interested in ways to reduce their costs while still hitting their target markets for both the advertisement of drugs and attracting candidates for trails.

10.2 Data preparation

10.2.1 Data labeling and feature selection

We only used the demographics database to avoid potential HIPAA breaches. The features below were selected to assist the marketing department with their market segmentation efforts:

10.2.2 Data cleanup

There were two columns for education, one that breaks down the elementary studies of the participants and another that more broadly indicates higher levels of education. We are not interested in such a level of granularity and proceeded to merge both columns and reduced the number of factors to mean “Highest level of education achieved”, this helped reduce the missing values from over 40% in each column to under 17%.

10.2.3 Data binning

For consistency, the Age feature was converted to categorical.

Check that features have the appropiate class

10.2.4 Data imputation

Impute missing values

Check that missing values are below 25%.

10.3 Hierarchical clustering

Hierarchical clustering was chosen due to the features being categorical

10.4 Data visualization

The first plot is a tally of how many observations there are in each cluster. Subsequent plots show the distribution of the features among each cluster. All the plots are shown after the code.

10.5 Conclusion

Although the data appears to be very homogenous, with many of the clusters having similar proportions. There are two clusters, 7 and 8, that encompass more observations. The data from these two would be recommended to the marketing department for further analysis.

11 Models for Shiny

The model used in the ShinyApp to precit if a patient was changes to contract cancer is Random Forest, give it best performance.